Exploiting Fine-Grain Parallelism in Recursive LU Factorization
نویسندگان
چکیده
The LU factorization is an important numerical algorithm for solving system of linear equations in science and engineering and is characteristic of many dense linear algebra computations. It has even become the de facto numerical algorithm implemented within the LINPACK benchmark to rank the most powerful supercomputers in the world, collected in the TOP500 website. In this context, the challenge in developing new algorithms for the scientific community resides in the combination of two goals: achieving high performance and maintaining the accuracy of the numerical algorithm. This paper proposes a novel approach for computing the LU factorization in parallel on multicore architectures, which not only improves the overall performance, but also sustains the numerical quality of the standard LU factorization algorithm with partial pivoting. While the update of the trailing submatrix is computationally intensive and highly parallel, the inherently problematic portion of the LU factorization is the panel factorization due to its memory-bound characteristic as well as the atomicity of selecting the appropriate pivots. We present a new approach to LU factorization of (narrow and tall) panel submatrices. It uses a parallel fine-grained recursive formulation of the factorization. It is based on conflict-free partitioning of the data and lockless synchronization mechanisms. As a result, our implementation lets the overall computation naturally flow without contention. Our recursive panel factorization provides the necessary performance increase for the inherently problematic portion of the LU factorization of square matrices. The reason is that even though the panel
منابع مشابه
Distributed Filaments: Eecient Fine-grain Parallelism on a Cluster of Workstations Distributed Filaments: Eecient Fine-grain Parallelism on a Cluster of Workstations
A ne-grain parallel program is one in which processes are typically small, ranging from a few to a few hundred instructions. Fine-grain parallelism arises naturally in many situations, such as iterative grid computations, recursive fork/join programs, the bodies of parallel FOR loops, and the implicit parallelism in functional or dataaow languages. It is useful both to describe massively parall...
متن کاملThe Expandable Split Window Paradigm for Exploiting Fine-Grain Parallelism
We propose a new processing paradigm, called the Expandable Split Window (ESW) paradigm, for exploiting finegrain parallelism. This paradigm considers a window of instructions (possibly having dependencies) as a single unit, and exploits fine-grain parallelism by overlapping the execution of multiple windows. The basic idea is to connect multiple sequential processors, in a decoupled and decent...
متن کاملScheduling dense linear algebra operations on multicore processors
State-of-the-art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffers performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time, the coarse–grain dataflow model gains popularity as a paradigm for programming multicore architectures. This work looks at implementing classic dense linear algebra workl...
متن کاملScheduling Linear Algebra Operations on Multicore Processors
State-of-the-art dense linear algebra software, such as the LAPACK and ScaLAPACK libraries, suffer performance losses on multicore processors due to their inability to fully exploit thread-level parallelism. At the same time the coarse-grain dataflow model gains popularity as a paradigm for programming multicore architectures. This work looks at implementing classic dense linear algebra workloa...
متن کاملA Feasibility Study of Hierarchical Multithreading
Many studies have shown that significant amounts of parallelism exist at different granularities. Execution models such as superscalar and VLIW exploit parallelism from a single thread. Multithreaded processors make a step towards exploiting parallelism from different threads, but are not geared to exploit parallelism at different granularities (fine and medium grain). In this paper we present ...
متن کامل